Let’s get started!
Loading in packages, removing objects from the environment and setting the working directory.
if (!require("pacman")) install.packages("pacman"); library(pacman)
pacman::p_load(tidyverse, caTools, ggthemes, janitor, corrplot, GGally, psych)
rm(list=ls())
setwd("/Users/Jesse/R/MedInsuranceProject/")
Reading in the data.
df <- read_csv("/Users/Jesse/R/MedInsuranceProject/insurance.csv")
Let’s take a took at the data:
head(df)
## # A tibble: 6 x 7
## age sex bmi children smoker region charges
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 19 female 27.9 0 yes southwest 16885.
## 2 18 male 33.8 1 no southeast 1726.
## 3 28 male 33 3 no southeast 4449.
## 4 33 male 22.7 0 no northwest 21984.
## 5 32 male 28.9 0 no northwest 3867.
## 6 31 female 25.7 0 no southeast 3757.
str(df)
## spec_tbl_df [1,338 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:1338] 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr [1:1338] "female" "male" "male" "male" ...
## $ bmi : num [1:1338] 27.9 33.8 33 22.7 28.9 ...
## $ children: num [1:1338] 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr [1:1338] "yes" "no" "no" "no" ...
## $ region : chr [1:1338] "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num [1:1338] 16885 1726 4449 21984 3867 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. sex = col_character(),
## .. bmi = col_double(),
## .. children = col_double(),
## .. smoker = col_character(),
## .. region = col_character(),
## .. charges = col_double()
## .. )
Let’s change the character columns to factors:
df$sex <- as.factor(df$sex)
df$smoker <- as.factor(df$smoker)
df$region <- as.factor(df$region)
Finally, let’s check for any missing values:
any(is.na(df))
## [1] FALSE
No missing data! Let’s move on to some exploratory data analysis.
summary(df)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
Let’s take a look at a histogram of the insurance charges:
df %>% ggplot(aes(charges)) + geom_histogram(aes(y=stat(density)), fill = "light blue", color = "white", bins = 30) +
ggtitle("Distribution of Charges") + theme_bw() + theme(plot.title = element_text(hjust = .5)) +
geom_density(col="dark blue")
This distribution is right-skewed. Let’s make the distribution closer to normal:
df %>% ggplot(aes(log10(charges))) + geom_histogram(aes(y=stat(density)), fill = "light blue", color = "white", bins = 30) +
ggtitle("Distribution of Charges") + theme_bw() + theme(plot.title = element_text(hjust = .5)) +
geom_density(col = "dark blue")
Now let’s look at charges by region:
describeBy(df$charges,df$region)
##
## Descriptive statistics by group
## group: northeast
## vars n mean sd median trimmed mad min max range
## X1 1 324 13406.38 11255.8 10057.65 11444.31 7806.78 1694.8 58571.07 56876.28
## skew kurtosis se
## X1 1.48 1.68 625.32
## ------------------------------------------------------------
## group: northwest
## vars n mean sd median trimmed mad min max range
## X1 1 325 12417.58 11072.28 8965.8 10414.54 7001.14 1621.34 60021.4 58400.06
## skew kurtosis se
## X1 1.67 2.53 614.18
## ------------------------------------------------------------
## group: southeast
## vars n mean sd median trimmed mad min max range
## X1 1 364 14735.41 13971.1 9294.13 12563.65 8749.51 1121.87 63770.43 62648.55
## skew kurtosis se
## X1 1.24 0.48 732.28
## ------------------------------------------------------------
## group: southwest
## vars n mean sd median trimmed mad min max
## X1 1 325 12346.94 11557.18 8798.59 10120.52 6329.39 1241.57 52590.83
## range skew kurtosis se
## X1 51349.26 1.67 2.03 641.08
df %>% ggplot(aes(region, charges)) + geom_boxplot(fill=c(3,4,6,7)) +
theme_bw() + ggtitle("Medical Insurance Charges by Region")
It looks like charges don’t differ much across regions, but the highest medical expenses are in the Southeast and the lowest are in the Southwest.
Let’s check out charges by smoker status:
describeBy(df$charges,df$smoker)
##
## Descriptive statistics by group
## group: no
## vars n mean sd median trimmed mad min max range
## X1 1 1064 8434.27 5993.78 7345.41 7599.76 5477.15 1121.87 36910.61 35788.73
## skew kurtosis se
## X1 1.53 3.12 183.75
## ------------------------------------------------------------
## group: yes
## vars n mean sd median trimmed mad min max
## X1 1 274 32050.23 11541.55 34456.35 31782.89 15167.19 12829.46 63770.43
## range skew kurtosis se
## X1 50940.97 0.13 -1.05 697.25
df %>% ggplot(aes(smoker, charges)) + geom_boxplot(fill=c(3,2)) +
theme_bw() + ggtitle("Medical Insurance Charges by Smoker Status")
As you would expect, smokers spend much more than non-smokers. Smokers spend about 4x more compared to non-smokers when it comes to medical insurance expenses.
Next, let’s look at sex:
describeBy(df$charges,df$sex)
##
## Descriptive statistics by group
## group: female
## vars n mean sd median trimmed mad min max range
## X1 1 662 12569.58 11128.7 9412.96 10455.16 7129.08 1607.51 63770.43 62162.92
## skew kurtosis se
## X1 1.72 2.71 432.53
## ------------------------------------------------------------
## group: male
## vars n mean sd median trimmed mad min max range
## X1 1 676 13956.75 12971.03 9369.62 11825.4 8121.53 1121.87 62592.87 61471
## skew kurtosis se
## X1 1.33 0.79 498.89
df %>% ggplot(aes(sex, charges)) + geom_boxplot(fill=c(5,7)) +
theme_bw() + ggtitle("Medical Insurance Charges by Sex")
It appears sex doesn’t seem to affect medical expenses very much.
Moving on to number of children:
describeBy(df$charges,df$children)
##
## Descriptive statistics by group
## group: 0
## vars n mean sd median trimmed mad min max
## X1 1 574 12365.98 12023.29 9856.95 10155.21 10067.29 1121.87 63770.43
## range skew kurtosis se
## X1 62648.55 1.53 1.95 501.84
## ------------------------------------------------------------
## group: 1
## vars n mean sd median trimmed mad min max range
## X1 1 324 12731.17 11823.63 8483.87 10364.8 5859.46 1711.03 58571.07 56860.05
## skew kurtosis se
## X1 1.66 1.97 656.87
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range
## X1 1 240 15073.56 12891.37 9264.98 12895.82 6587.43 2304 49577.66 47273.66
## skew kurtosis se
## X1 1.28 0.35 832.13
## ------------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max
## X1 1 157 15355.32 12330.87 10600.55 13220.71 6918.06 3443.06 60021.4
## range skew kurtosis se
## X1 56578.33 1.45 1.21 984.11
## ------------------------------------------------------------
## group: 4
## vars n mean sd median trimmed mad min max range
## X1 1 25 13850.66 9139.22 11033.66 12401.81 7109.3 4504.66 40182.25 35677.58
## skew kurtosis se
## X1 1.45 1.59 1827.84
## ------------------------------------------------------------
## group: 5
## vars n mean sd median trimmed mad min max range
## X1 1 18 8786.04 3808.44 8589.57 8402.35 3631.71 4687.8 19023.26 14335.46
## skew kurtosis se
## X1 1.04 0.54 897.66
df %>% ggplot(aes(as.factor(children), charges)) + geom_boxplot(fill=c(2:7)) +
theme_bw() + ggtitle("Medical Insurance Charges by Number of Children")
Interestingly, on average, people with 5 children spend less on medical expense than all other groups.
Let’s check out obesity now. First we need to create a new column for obese, which is greater than or equal to a BMI of 30.
df <- df %>%
mutate(obese = as.factor(if_else(bmi >= 30, "yes", "no")))
describeBy(df$charges,df$obese)
##
## Descriptive statistics by group
## group: no
## vars n mean sd median trimmed mad min max range
## X1 1 631 10713.67 7843.54 8604.48 9772.74 7024.01 1121.87 38245.59 37123.72
## skew kurtosis se
## X1 0.97 0.23 312.25
## ------------------------------------------------------------
## group: yes
## vars n mean sd median trimmed mad min max
## X1 1 707 15552.34 14552.32 9964.06 13451.03 7883.43 1131.51 63770.43
## range skew kurtosis se
## X1 62638.92 1.18 0.08 547.3
df %>% ggplot(aes(obese, charges)) + geom_boxplot(fill=c(6,4)) +
theme_bw() + ggtitle("Medical Charges by Obesity status (BMI 30+)")
As we can see, the average medical expenses for an obese individual is about $5000 more than non-obese people, while the median expenses between the group is similar.
Lastly, let’s look at age, but also with smoker status.
df %>% ggplot(aes(age, charges)) + geom_point(aes(color = smoker)) +
theme_bw() + ggtitle("Medical Charges by Age and Smoker Status") + scale_color_manual(values=c("red", "blue"))
As expected, as people get older they tend to have higher medical expenses, but adding smoker status you can see the relationship between smoking and higher medical expenses.
Alright, let’s move on to looking at correlation between variables.
num_cols <- sapply(df,is.numeric) # using only numeric features
ggpairs(df[,num_cols]) # matrix of plots
pairs.panels(df[,num_cols]) # matrix of plots using pairs.panels
We can see that age and charges have the highest correlation among the numeric variables. It’s also interesting to note that none of our numeric values are highly correlated with each other, so multicollinearity won’t be a problem.
Now let’s look at the correlation matrix for numeric features a different way.
cor_mx <- cor(df[,num_cols]) # correlation matrix
cor_mx
## age bmi children charges
## age 1.0000000 0.1092719 0.04246900 0.29900819
## bmi 0.1092719 1.0000000 0.01275890 0.19834097
## children 0.0424690 0.0127589 1.00000000 0.06799823
## charges 0.2990082 0.1983410 0.06799823 1.00000000
corrplot(cor_mx, method = "color", title = "Correlation Matrix for Numeric Features",
mar=c(0,0,1,0))
Let’s move on and look at correlation between all features. But first we need to create dummy variables for sex and smoker.
df_dummy <- df %>%
mutate(sex_dummy = if_else(sex == "male", 1, 0)) %>%
mutate(smoker_dummy = if_else(smoker == "yes", 1, 0))
df_dummy <- df_dummy %>%
select(1,3,4,7,9,10)
# correlation all features
ggpairs(df_dummy)
pairs.panels(df_dummy)
dummy_cor_mx <- cor(df_dummy)
corrplot(dummy_cor_mx, method = "color", title = "Correlation Matrix for All Features",
mar=c(0,0,1,0))
We can see that smoking and medical expense are highly correlated.
Let’s move on to building a model using all of the original features, except using obese instead of BMI because BMI is essentially the same thing as obese.
set.seed(214)
sample <- sample.split(df$charges, SplitRatio = 0.70)
# Training Data
train = subset(df, sample == TRUE)
# Testing Data
test = subset(df, sample == FALSE)
med_model <- lm(charges ~ .-bmi, data = train)
summary(med_model)
##
## Call:
## lm(formula = charges ~ . - bmi, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12378.6 -3632.7 -411.7 1184.5 28190.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4148.16 760.41 -5.455 6.28e-08 ***
## age 269.45 14.26 18.890 < 2e-16 ***
## sexmale 57.44 402.38 0.143 0.88652
## children 502.32 164.85 3.047 0.00238 **
## smokeryes 22867.71 504.00 45.373 < 2e-16 ***
## regionnorthwest -587.99 580.23 -1.013 0.31115
## regionsoutheast -498.98 571.72 -0.873 0.38302
## regionsouthwest -1270.60 580.84 -2.188 0.02895 *
## obeseyes 3999.25 409.43 9.768 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6109 on 927 degrees of freedom
## Multiple R-squared: 0.7304, Adjusted R-squared: 0.7281
## F-statistic: 314 on 8 and 927 DF, p-value: < 2.2e-16
plot(med_model)
Just using the original variables we get a fairly good R-squared of 0.7304, which implies that about 73% of the variation in charges can be explained by the independent variables. Also, all variables except for region and sex, are statistically significant predictors of medical expense (p < .05).
Let’s attempt to improve the model:
First, let’s remove region and sex, as they were not statistically significant.
med_model2 <- lm(charges ~ . -bmi -sex -region, data = train)
summary(med_model2)
##
## Call:
## lm(formula = charges ~ . - bmi - sex - region, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12644.8 -3631.6 -284.8 1183.8 28264.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4711.97 646.07 -7.293 6.46e-13 ***
## age 269.60 14.26 18.900 < 2e-16 ***
## children 490.02 164.61 2.977 0.00299 **
## smokeryes 22932.73 500.57 45.813 < 2e-16 ***
## obeseyes 3975.47 402.32 9.881 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6112 on 931 degrees of freedom
## Multiple R-squared: 0.729, Adjusted R-squared: 0.7278
## F-statistic: 626.1 on 4 and 931 DF, p-value: < 2.2e-16
plot(med_model2)
This model is about the same, but slightly worse, with an R-squared of 0.729, still implying about 73% of the variation in charges can be explained by the independent variables. Let’s try a model with an interaction between obesity and smoker status.
med_model3 <- lm(charges ~ obese*smoker + age + children, data = train)
summary(med_model3)
##
## Call:
## lm(formula = charges ~ obese * smoker + age + children, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19072.0 -2025.0 -1395.7 -604.1 24448.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2834.59 503.27 -5.632 2.36e-08 ***
## obeseyes 54.04 346.65 0.156 0.876
## smokeryes 13077.17 548.52 23.841 < 2e-16 ***
## age 272.33 10.99 24.779 < 2e-16 ***
## children 609.37 126.91 4.802 1.83e-06 ***
## obeseyes:smokeryes 19481.61 771.01 25.268 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4709 on 930 degrees of freedom
## Multiple R-squared: 0.8393, Adjusted R-squared: 0.8385
## F-statistic: 971.6 on 5 and 930 DF, p-value: < 2.2e-16
plot(med_model3)
This third model is much better than the first two, with an R-squared of .8393, implying about 84% of the variation in charges can be explained by the independent variables. Our Residual vs Fitted plot appears a little better, although the Normal Q-Q plot still indicates some problems with our fit.
test$predicted <- predict(med_model3, newdata = test)
test %>%
ggplot() + geom_point(aes(predicted, charges, color = smoker, shape = obese)) +
geom_abline(color = "red") + ggtitle("Prediction vs Real Values")
It looks like our model is a pretty solid fit for our test data and our results are pretty accurate.
# calculating residuals
test$residuals <- test$charges - test$predicted
# plot residuals
test %>%
ggplot() + geom_pointrange(aes(predicted, residuals, ymin = 0, ymax = residuals, color = smoker, shape = obese)) +
geom_hline(yintercept = 0) + ggtitle("Residuals vs Fitted Values")
Overall, the residuals look pretty good, although there are several high residuals that could be caused by factors not included in the original data set.